Skip to content

feat(temporal): opt-in continue-as-new for long-lived agent workflows#447

Open
danielmillerp wants to merge 1 commit into
nextfrom
dm/temporal-continue-as-new
Open

feat(temporal): opt-in continue-as-new for long-lived agent workflows#447
danielmillerp wants to merge 1 commit into
nextfrom
dm/temporal-continue-as-new

Conversation

@danielmillerp

@danielmillerp danielmillerp commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Why

Long-lived chat/session agents (e.g. the Emu/FDD researcher) run as a single Temporal workflow that stays open indefinitely. Their event history grows until it hits Temporal's ~50k-event / 50MB limit and the workflow stalls — this is the root cause behind "chats die / state outgrows the 2MB payload" (P0 for EY).

This PR adds an opt-in continue-as-new path so a session can stay open forever by recycling its history, plus the discipline of keeping messages/state outside workflow state so they survive the recycle.

Two orthogonal levers: continue-as-new bounds history size; the chain-wide WORKFLOW_EXECUTION_TIMEOUT_SECONDS bounds wall-clock lifetime. continue-as-new does not extend the execution timeout — raise that knob too to keep workflows long-lived.

SDK — BaseWorkflow helpers (opt-in)

  • should_continue_as_new() — recycle decision: Temporal's is_continue_as_new_suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold.
  • drain_and_continue_as_new() — waits all_handlers_finished (so an in-flight turn isn't lost/duplicated at the boundary), then continue_as_new.
  • run_until_complete() — drop-in replacement for the usual wait_condition(timeout=None) tail. Gated once behind workflow.patched() so in-flight pre-patch workflows keep the old behaviour and don't hit a non-determinism error on replay.
  • conversation_from_messages() — rebuild the conversation from the adk.messages ledger after a recycle (messages live in adk.messages, not workflow state).

Config (default OFF — existing agents unaffected)

  • WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool)
  • WORKFLOW_MAX_HISTORY_LENGTH (int | None)

Examples

All 13 long-lived Temporal tutorial agents adopt run_until_complete:

  • Message-based chat (010, 050, 060, 070, 080, 100, 120) — rebuild conversation from adk.messages.
  • Harness/session — persist non-message state to adk.state and re-hydrate on recycle: opaque session handles for claude-sdk (090), claude-code (140), codex (150); rich ModelMessage history for pydantic-ai (110, via ModelMessagesTypeAdapter); langgraph (130) rebuilds from the ledger.
  • 000 (no per-turn state) just swaps the wait.

Every adk.state / adk.messages round-trip is guarded by the enabled flag, so the default path is byte-for-byte unchanged.

Verification

  • New unit tests for the recycle decision logic — tests/lib/core/temporal/test_base_workflow_continue_as_new.py (5 passing).
  • Full tests/lib/core/temporal suite: 8 passed, no regressions.
  • py_compile + ruff clean across all 16 changed files.

Follow-ups (not in this PR)

  • Replay/integration test of drain_and_continue_as_new against a Temporal test server.
  • Validate the pattern (drain + patch + chain-timeout) with the Temporal team before enabling in production.
  • Optional platform-level "transparent for all agents" variant (SDK owns the run loop) — deferred per discussion.

🤖 Generated with Claude Code

Greptile Summary

This PR adds opt-in continue-as-new support for long-running Temporal agent workflows. The main changes are:

  • BaseWorkflow helpers for recycle decisions, handler draining, and continued-run detection.
  • Conversation rehydration from adk.messages after workflow history is recycled.
  • adk.state persistence for agents with opaque session handles or non-text model history.
  • Tutorial workflow updates to use the new run_until_complete path.
  • New environment flags and tests for continue-as-new behavior.

Confidence Score: 4/5

The continue-as-new support is mostly contained and opt-in, but the conversation restore path needs attention before relying on recycled workflows.

The main implementation and tests cover the recycle decision logic, and the remaining issue is localized to message pagination during conversation rehydration.

src/agentex/lib/core/temporal/workflows/workflow.py

T-Rex T-Rex Logs

What T-Rex did

  • Reproduced zero-based pagination behavior by running a focused script against the real BaseWorkflow.conversation_from_messages implementation with a fake continued workflow run and a fake adk.messages.list that only returns ledger messages for page 0; the fake messages client logged HTTP 200 OK for page_number=1 as the first and only restore request, proving page 0 was skipped; the restored conversation was empty despite two text ledger messages existing on page 0, and the script failed with an assertion that the first requested page should be 0.
  • T-Rex ran the requested verification, but its local artifact references were not uploaded.
  • Before and after guard-state runs show progress: before the guard exercise exits with code 1 due to missing run_until_complete and legacy wait_condition lines, after the guard exercise exits with code 0 and per-file run_until_complete lines plus guarded adk.state calls, and supplemental codex-focused logs show attempted focused test execution with blockers such as uv: not found and No module named pytest.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/agentex/lib/core/temporal/workflows/workflow.py:244-251
**Use zero-based pages**

The messages SDK's offset pagination is zero-based, but this restore loop starts at `1`. When a workflow continues as new, chats with 200 or fewer ledger messages fetch the empty second page and restore an empty conversation. Longer chats drop the first 200 messages from model context. Please start from page 0, or otherwise match the messages API page base, before rebuilding the conversation.

Reviews (8): Last reviewed commit: "feat(temporal): opt-in continue-as-new f..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@danielmillerp danielmillerp changed the base branch from main to next June 24, 2026 20:20
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 5d63a08 to 4170651 Compare June 24, 2026 20:22
@socket-security

socket-security Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​agentex-sdk@​0.13.0 ⏵ 0.14.094 +1100100100100
Updatedpypi/​agentex-client@​0.13.0 ⏵ 0.15.099 +1100100 +1100100

View full report

Comment thread src/agentex/lib/core/temporal/workflows/workflow.py Outdated
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 4170651 to 891ef6d Compare June 24, 2026 21:07
Comment thread src/agentex/lib/core/temporal/workflows/workflow.py Outdated
Comment thread examples/tutorials/10_async/10_temporal/150_codex/project/workflow.py Outdated
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 1fb74c4 to 22e7358 Compare June 26, 2026 17:55
Comment thread src/agentex/lib/core/temporal/workflows/workflow.py Outdated
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 22e7358 to ad68bd8 Compare June 26, 2026 18:11
Comment thread src/agentex/lib/core/temporal/workflows/workflow.py Outdated
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from ad68bd8 to 65ab89a Compare June 26, 2026 18:27
Comment thread src/agentex/lib/core/temporal/workflows/workflow.py Outdated
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 65ab89a to 43f62c2 Compare June 26, 2026 18:44
Long-lived chat/session agents run as a single Temporal workflow that stays
open indefinitely, so their event history grows until it hits Temporal's
~50k-event / 50MB limit and the workflow stalls. This adds an opt-in
continue-as-new path that recycles the history so a session can stay open
forever, plus the discipline of keeping messages/state outside workflow state
so they survive the recycle.

SDK (BaseWorkflow):
- should_continue_as_new(): recycle decision (Temporal's is_continue_as_new_
  suggested() or a configurable WORKFLOW_MAX_HISTORY_LENGTH threshold).
- drain_and_continue_as_new(): waits all_handlers_finished (so an in-flight
  turn is never lost/duplicated at the boundary) then continue_as_new.
- run_until_complete(): drop-in replacement for the usual
  wait_condition(timeout=None) tail; gated once behind workflow.patched() so
  in-flight pre-patch workflows keep the old behaviour (no non-determinism on
  replay). Identical behaviour unless WORKFLOW_CONTINUE_AS_NEW_ENABLED is set.
- conversation_from_messages(): rebuild the conversation from the adk.messages
  ledger after a recycle (messages live in adk.messages, not workflow state).

Config (default off, so existing agents are unaffected):
- WORKFLOW_CONTINUE_AS_NEW_ENABLED (bool)
- WORKFLOW_MAX_HISTORY_LENGTH (int|None)

Examples: all 13 long-lived Temporal tutorial agents adopt run_until_complete.
Message-based chat agents rebuild conversation from adk.messages; harness
agents with an opaque session handle (claude-code, codex, claude-sdk) or rich
history (pydantic-ai via ModelMessagesTypeAdapter, langgraph) persist their
non-message state to adk.state and re-hydrate on recycle. Every adk.state /
adk.messages round-trip is guarded by the enabled flag, so the default path is
byte-for-byte unchanged.

Note: continue-as-new bounds history SIZE; it does NOT extend the chain-wide
WORKFLOW_EXECUTION_TIMEOUT_SECONDS (raise that to keep workflows long-lived).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@danielmillerp danielmillerp force-pushed the dm/temporal-continue-as-new branch from 43f62c2 to 9d71bb7 Compare June 26, 2026 18:59
Comment on lines +244 to +251
messages = []
page_number = 1
while True:
page = await adk.messages.list(
task_id=task_id,
limit=_CONVERSATION_PAGE_SIZE,
page_number=page_number,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Use zero-based pages

The messages SDK's offset pagination is zero-based, but this restore loop starts at 1. When a workflow continues as new, chats with 200 or fewer ledger messages fetch the empty second page and restore an empty conversation. Longer chats drop the first 200 messages from model context. Please start from page 0, or otherwise match the messages API page base, before rebuilding the conversation.

Artifacts

Repro: focused BaseWorkflow zero-based pagination script

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Stack trace captured during the T-Rex run

  • Keeps the raw stack trace available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/temporal/workflows/workflow.py
Line: 244-251

Comment:
**Use zero-based pages**

The messages SDK's offset pagination is zero-based, but this restore loop starts at `1`. When a workflow continues as new, chats with 200 or fewer ledger messages fetch the empty second page and restore an empty conversation. Longer chats drop the first 200 messages from model context. Please start from page 0, or otherwise match the messages API page base, before rebuilding the conversation.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant